A Method for Clustering High-Dimensional Data Using 1D Random Projections
نویسندگان
چکیده
Han, Sangchun PhD, Purdue University, December 2014. A Method for Clustering High-Dimensional Data Using 1D Random Projections. Major Professor: Mireille Boutin. Clustering high-dimensional data is more difficult than clustering low-dimensional data. The problem is twofold. First, there is an efficiency problem related to the data size, which increases with the dimensionality. Second, there is an effectiveness problem related to the fact that the mere existence of clusters in sample sets of high dimensions is questionable, as empirical samples hardly tend to cluster together in a meaningful fashion. The current approach to addressing this issue is to seek clusters in embedded subspaces of the original space. However, as dimensionality increases, a naive exhaustive search among all subspaces becomes exponentially more complex, which leads to an overwhelming time complexity. We propose an alternative approach for high-dimensional data clustering. Our solution is a top-down hierarchical clustering method using a binary tree of 1D random projections. As real data tends to have a lot of structures, we show that a 1D random projection of real data captures some of that structure with a high probability. More specifically, the structure manifests itself as a clear binary clustering in the projected data (1D). Our approach is efficient because most of the computations are performed in 1D. To increase efficiency of our method even further, we propose a fast 1D 2-means clustering method, which takes advantage of the 1D space. Our method achieves a better quality of clustering as well as a lower run-time compared to existing high-dimensional clustering methods.
منابع مشابه
High-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملFast and Robust Subspace Clustering Using Random Projections
Over the past several decades, subspace clustering has been receiving increasing interest and continuous progress. However, due to the lack of scalability and/or robustness, existing methods still have difficulty in dealing with the data that possesses simultaneously three characteristics: high-dimensional, massive and grossly corrupted. To tackle the scalability and robustness issues simultane...
متن کاملIterative random projections for high-dimensional data clustering
In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the Kmeans algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data after each convergence of K-means. We compare the proposed algorithm on four high-dimensional datasets, image, text and two synthetic, with K-means c...
متن کاملEnsemble Fuzzy Clustering using Cumulative Aggregation on Random Projections
Random projection is a popular method for dimensionality reduction due to its simplicity and efficiency. In the past few years, random projection and fuzzy c-means based cluster ensemble approaches have been developed for high dimensional data clustering. However, they require large amounts of space for storing a big affinity matrix, and incur large computation time while clustering in this aff...
متن کاملProjective clustering of high dimensional data
Clustering of high-dimensional data can be problematic, because the usual notions of distance or similarity break down for data in high dimensions. More specifically, it can be shown that, as the number of dimensions increases, the distance to the nearest point approaches the distance to the farthest one. Two approaches are common for dealing with this problem. The idea behind the first approac...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016